House Price - Data Prep
House Price - Data Prep
- 1 Objetivos
- 2 Conjunto de Dados
- 3 Há quantos tipo de dados ?
- 4 Resumo dos dados
- 5 Selecionando as variáveis inteiras
- 6 Qualidade dos dados
- 7 Variáveis selecionadas df.fac
- 8 Colocando tudo em listas
- 9 Juntandos os data frames
- 10 Dados de teste
- 11 Separando o conjunto de dados de teste pelo tipo.
- 12 Selecionando as variáveis
- 13 Colocando tudo em listas
- 14 Juntandos os data frames
- 15 Exportando os dados limpos
1 Objetivos
- Número de variáveis: 81
- Tipo de variáveis
- Inteiras ou discretas:
- Numéricas ou double
- Categóricas
- Qualitativas
- Qualidade dos dados
- Quantidade de NA’s por variável
- Criação de novas variáveis, se precisar
- Transformação das variáveis, se precisar
2 Conjunto de Dados
2.1 Dados de treinamento
# Read table
df.train <- data.table::fread('../dados/train.csv',
sep=",",
showProgress = FALSE) %>%
data.frame(stringsAsFactors = F)
df.train3 Há quantos tipo de dados ?
lapply(df.train,class) %>%
unlist %>% as.character() %>% table## .
## character integer
## 43 38
3.1 Dados tipo inteiro
df.int <- df.train[,unlist(lapply(df.train,is.integer))]
df.int4 Resumo dos dados
skim_to_wide(df.int) %>% arrange(desc(missing))veja que há valores nulos presente na amostra, minha estratégia será remover variáveis que possuam muitos valores nulos e imputar dados para as variáveis com poucos valores nulos.
4.1 Pré-Seleção das Variáveis inteiras
set.seed(2510)
boruta.int <- Boruta(SalePrice ~ ., data=na.omit(df.int[,-1]), doTrace=2) # perform Boruta searchcols.int <- names(boruta.int$finalDecision[boruta.int$finalDecision %in% c("Confirmed", "Tentative")]) # collect4.2 Plot da importância da variável
plot(boruta.int, cex.axis=.7, las=2, xlab="", main="Variable Importance")5 Selecionando as variáveis inteiras
df.int <- df.int[,c('Id',"SalePrice",cols.int)]6 Qualidade dos dados
6.1 Tipo inteiro: % de valores nulos
apply(is.na(df.int),2,
function(x) round(100*sum(as.numeric(x))/length(x),2)) %>%
sort(decreasing = T)## LotFrontage GarageYrBlt MasVnrArea Id SalePrice
## 17.74 5.55 0.55 0.00 0.00
## MSSubClass LotArea OverallQual OverallCond YearBuilt
## 0.00 0.00 0.00 0.00 0.00
## YearRemodAdd BsmtFinSF1 BsmtUnfSF TotalBsmtSF X1stFlrSF
## 0.00 0.00 0.00 0.00 0.00
## X2ndFlrSF GrLivArea BsmtFullBath FullBath HalfBath
## 0.00 0.00 0.00 0.00 0.00
## BedroomAbvGr KitchenAbvGr TotRmsAbvGrd Fireplaces GarageCars
## 0.00 0.00 0.00 0.00 0.00
## GarageArea WoodDeckSF OpenPorchSF EnclosedPorch ScreenPorch
## 0.00 0.00 0.00 0.00 0.00
De forma geral podemos concluir que não há muitos valores nulos para as variáveis do tipo inteiro, sendo que apenas a variável LotFrontage apresenta cerca de 17.74 % de valores ausentes e como há uma série de outras variáveis mais importantes que esta, decido por remove-la.
6.2 Removendo a variável LotFrontage
df.int <- df.int %>% select(-LotFrontage)6.3 Imputando dados com knn
Criando um modelo com o knn para imputar dados.
preProcess_missingdata_model <- preProcess(df.int[,-c(1,2)], method='knnImpute')
preProcess_missingdata_model## Created from 1371 samples and 27 variables
##
## Pre-processing:
## - centered (27)
## - ignored (0)
## - 5 nearest neighbor imputation (27)
## - scaled (27)
Vamos agora usar esse modelo para prever os valores ausentes df.int
df.int1 <- predict(preProcess_missingdata_model, newdata = df.int[,-c(1,2)] )
anyNA(df.int1)## [1] FALSE
Todos os valores ausentes foram imputados com sucesso, agora vamos juntar as colunas Id e SalesPrice com df.int1
df.int <- cbind(df.int[,c(1,2)],df.int1)Convertento dados tipo string para tipo categórico
df.fac <- df.train[,unlist(lapply(df.train,is.character))] %>%
apply(2,as.factor) %>%
data.frame()6.4 Tipo Categorical: % de valores nulos
apply(is.na(df.fac),2,function(x) round(100*sum(as.numeric(x))/length(x),2)) %>%
sort(decreasing = T)## PoolQC MiscFeature Alley Fence FireplaceQu
## 99.52 96.30 93.77 80.75 47.26
## GarageType GarageFinish GarageQual GarageCond BsmtExposure
## 5.55 5.55 5.55 5.55 2.60
## BsmtFinType2 BsmtQual BsmtCond BsmtFinType1 MasVnrType
## 2.60 2.53 2.53 2.53 0.55
## Electrical MSZoning Street LotShape LandContour
## 0.07 0.00 0.00 0.00 0.00
## Utilities LotConfig LandSlope Neighborhood Condition1
## 0.00 0.00 0.00 0.00 0.00
## Condition2 BldgType HouseStyle RoofStyle RoofMatl
## 0.00 0.00 0.00 0.00 0.00
## Exterior1st Exterior2nd ExterQual ExterCond Foundation
## 0.00 0.00 0.00 0.00 0.00
## Heating HeatingQC CentralAir KitchenQual Functional
## 0.00 0.00 0.00 0.00 0.00
## PavedDrive SaleType SaleCondition
## 0.00 0.00 0.00
Veja que as variáveis PoolQC, MiscFeature, Alley, Fence e FireplaceQu não nos deixam outra alternativa senão a remoção delas do conjunto de dados, pois neste caso, a imputação de dados seria um grande problema na propagação do erro considerando as incertezas associadas aos métodos de imputação.
6.5 Selecão de Variável Tipo String:
- PoolQC
- MiscFeature
- Alley
- Fence
- FireplaceQu
df.fac <- df.fac %>%
select(-PoolQC,-MiscFeature,
-Alley,-Fence,-FireplaceQu)
df.fac6.6 Pré-Seleção das Variáveis Categóricas
set.seed(2510)
boruta_fac <- Boruta(SalePrice ~ ., data=na.omit(data.frame(SalePrice = df.int[,'SalePrice'],df.fac)), doTrace=2) # perform Boruta searchcols.fac <- names(boruta_fac$finalDecision[boruta_fac$finalDecision %in% c("Confirmed", "Tentative")]) 6.7 Plot da importância da variável
plot(boruta_fac, cex.axis=.7, las=2, xlab="", main="Variable Importance")7 Variáveis selecionadas df.fac
df.fac <- df.fac[,cols.fac]apply(is.na(df.fac),2,function(x) round(100*sum(as.numeric(x))/length(x),2)) %>%
sort(decreasing = T)## GarageType GarageFinish GarageCond BsmtExposure BsmtQual
## 5.55 5.55 5.55 2.60 2.53
## BsmtCond BsmtFinType1 MasVnrType Electrical MSZoning
## 2.53 2.53 0.55 0.07 0.00
## LotShape LandContour LandSlope Neighborhood Condition1
## 0.00 0.00 0.00 0.00 0.00
## BldgType HouseStyle RoofStyle RoofMatl Exterior1st
## 0.00 0.00 0.00 0.00 0.00
## Exterior2nd ExterQual ExterCond Foundation HeatingQC
## 0.00 0.00 0.00 0.00 0.00
## CentralAir KitchenQual Functional PavedDrive SaleCondition
## 0.00 0.00 0.00 0.00 0.00
Como ainda estamos com valores nulos recorremos ao caret para imputar essas categorias.
Para construir um modelo que imput a categoria vamos retirar todas as variáveis que possuem alguma porcentagem de valores nulos e deixar somente uma delas em cada modelo.
7.1 Criando os data frames
df.GarageType <- df.fac %>% select(-GarageFinish,-GarageCond,
-BsmtExposure,-BsmtQual,
-BsmtCond,-BsmtFinType1,
-MasVnrType,-Electrical)df.GarageFinish <- df.fac %>% select(-GarageType,-GarageCond,
-BsmtExposure,-BsmtQual,
-BsmtCond,-BsmtFinType1,
-MasVnrType,-Electrical)df.GarageCond <- df.fac %>% select(-GarageType,-GarageFinish,
-BsmtExposure,-BsmtQual,
-BsmtCond,-BsmtFinType1,
-MasVnrType,-Electrical)df.BsmtExposure <- df.fac %>% select(-GarageType,-GarageFinish,
-GarageCond,-BsmtQual,
-BsmtCond,-BsmtFinType1,
-MasVnrType,-Electrical)df.BsmtQual <- df.fac %>% select(-GarageType,-GarageFinish,
-GarageCond,-BsmtExposure,
-BsmtCond,-BsmtFinType1,
-MasVnrType,-Electrical)df.BsmtCond <- df.fac %>% select(-GarageType,-GarageFinish,
-GarageCond,-BsmtExposure,
-BsmtQual,-BsmtFinType1,
-MasVnrType,-Electrical)df.BsmtFinType1 <- df.fac %>% select(-GarageType,-GarageFinish,
-GarageCond,-BsmtExposure,
-BsmtQual,-BsmtCond,
-MasVnrType,-Electrical)df.MasVnrType <- df.fac %>% select(-GarageType,-GarageFinish,
-GarageCond,-BsmtExposure,
-BsmtQual,-BsmtCond,
-BsmtFinType1,-Electrical)df.Electrical <- df.fac %>% select(-GarageType,-GarageFinish,
-GarageCond,-BsmtExposure,
-BsmtQual,-BsmtCond,
-BsmtFinType1,-MasVnrType)7.2 Usando o Random Forest
set.seed(12345)
fitControl <- trainControl(method="cv",
number=3,
savePredictions = 'final',
classProbs= F,
summaryFunction = multiClassSummary)set.seed(12345)
# Train the model using rf
rf.GarageType = train(GarageType ~ ., data= na.omit(df.GarageType),
tuneLength=5,trControl = fitControl,method='rf')
rf.GarageFinish = train(GarageFinish ~ ., data= na.omit(df.GarageFinish),
tuneLength=5,trControl = fitControl,method='rf')
rf.GarageCond = train(GarageCond ~ ., data= na.omit(df.GarageCond),
tuneLength=5,trControl = fitControl,method='rf')
rf.BsmtExposure = train(BsmtExposure ~ ., data= na.omit(df.BsmtExposure),
tuneLength=5,trControl = fitControl,method='rf')
rf.BsmtQual = train(BsmtQual ~ ., data= na.omit(df.BsmtQual),
tuneLength=5,trControl = fitControl,method='rf')
rf.BsmtCond = train(BsmtCond ~ ., data= na.omit(df.BsmtCond),
tuneLength=5,trControl = fitControl,method='rf')
rf.BsmtFinType1 = train(BsmtFinType1 ~ ., data= na.omit(df.BsmtFinType1),
tuneLength=5,trControl = fitControl,method='rf')
rf.MasVnrType = train(MasVnrType ~ ., data= na.omit(df.MasVnrType),
tuneLength=5,trControl = fitControl,method='rf')
rf.Electrical = train(Electrical ~ ., data= na.omit(df.Electrical),
tuneLength=5,trControl = fitControl,method='rf')A função abaixo automatiza o processo de impute das categoricas nos valores nulos de cada variável.
f.pred <- function(fac,df.var,rf.model,var){
new.df <- df.var[is.na(df.var[,var]),!(names(df.var) %in% var)]
pred_rf <- predict(rf.model, newdata = new.df)
fac[is.na(fac[,var]),var] <- pred_rf
return(df.fac)
}8 Colocando tudo em listas
list.rfvars <- list(rf.GarageType,rf.GarageFinish,
rf.GarageCond,rf.BsmtExposure,
rf.BsmtQual,rf.BsmtCond,
rf.BsmtFinType1,rf.MasVnrType,
rf.Electrical)
list.dfvars <- list(df.GarageType,df.GarageFinish,
df.GarageCond,df.BsmtExposure,
df.BsmtQual,df.BsmtCond,
df.BsmtFinType1,df.MasVnrType,
df.Electrical)
vars <- c('GarageType','GarageFinish',
'GarageCond','BsmtExposure',
'BsmtQual','BsmtCond',
'BsmtFinType1','MasVnrType',
'Electrical')Imputando os valores categóricos
for(j in 1:9) df.fac <- f.pred(df.fac,list.dfvars[[j]],list.rfvars[[j]],vars[j])9 Juntandos os data frames
Jutando os dados tipo inteiros e string.
df.train <- bind_cols(df.int,df.fac)
df.trainAgora nosso df.train encontra-se limpo e pronto para ser explorado.
10 Dados de teste
Iremos realizar nos dados de teste as mesmas transformações aplicadas aos dados de treino.
df.test <- data.table::fread('../dados/test.csv',
sep=",",
showProgress = FALSE) %>%
data.frame(stringsAsFactors = F)
df.test11 Separando o conjunto de dados de teste pelo tipo.
11.1 Dados tipo inteiro
test.int <- df.test[,unlist(lapply(df.test,class)) %in% "integer"]
test.intna amostra test.int também iremos imputar dados.
apply(is.na(test.int),2,function(x) round(100*sum(as.numeric(x))/length(x),2)) %>%
sort(decreasing = T)## LotFrontage GarageYrBlt MasVnrArea BsmtFullBath BsmtHalfBath
## 15.56 5.35 1.03 0.14 0.14
## BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF GarageCars
## 0.07 0.07 0.07 0.07 0.07
## GarageArea Id MSSubClass LotArea OverallQual
## 0.07 0.00 0.00 0.00 0.00
## OverallCond YearBuilt YearRemodAdd X1stFlrSF X2ndFlrSF
## 0.00 0.00 0.00 0.00 0.00
## LowQualFinSF GrLivArea FullBath HalfBath BedroomAbvGr
## 0.00 0.00 0.00 0.00 0.00
## KitchenAbvGr TotRmsAbvGrd Fireplaces WoodDeckSF OpenPorchSF
## 0.00 0.00 0.00 0.00 0.00
## EnclosedPorch X3SsnPorch ScreenPorch PoolArea MiscVal
## 0.00 0.00 0.00 0.00 0.00
## MoSold YrSold
## 0.00 0.00
12 Selecionando as variáveis
Removendo as mesmas variáveis do conjunto de treinamento.
test.int <- test.int[,c('Id',cols.int)] %>%
select(-LotFrontage)preProcess_missingdata_model <- preProcess(test.int[,-1], method='knnImpute')
preProcess_missingdata_model## Created from 1365 samples and 27 variables
##
## Pre-processing:
## - centered (27)
## - ignored (0)
## - 5 nearest neighbor imputation (27)
## - scaled (27)
Vamos agora usar esse modelo para prever os valores ausentes df.int
test.int1 <- predict(preProcess_missingdata_model, newdata = test.int[,-1] )
anyNA(test.int1)## [1] FALSE
test.int <- data.frame(Id = test.int$Id,test.int1)12.1 Tranformando dados tipo string em categorical
test.fac <- df.test[,unlist(lapply(df.test,class)) %in% "character"] %>%
apply(2,as.factor) %>% data.frameremovendo as mesas colunas do conjunto de treino.
test.fac <- test.fac %>% select(-PoolQC,-MiscFeature,
-Alley,-Fence,-FireplaceQu)
test.fac <- test.fac[,cols.fac]verificando a porcentagem de valores nulos
apply(is.na(test.fac),2,function(x) round(100*sum(as.numeric(x))/length(x),2)) %>%
sort(decreasing = T)## GarageFinish GarageCond GarageType BsmtCond BsmtQual
## 5.35 5.35 5.21 3.08 3.02
## BsmtExposure BsmtFinType1 MasVnrType MSZoning Functional
## 3.02 2.88 1.10 0.27 0.14
## Exterior1st Exterior2nd KitchenQual LotShape LandContour
## 0.07 0.07 0.07 0.00 0.00
## LandSlope Neighborhood Condition1 BldgType HouseStyle
## 0.00 0.00 0.00 0.00 0.00
## RoofStyle RoofMatl ExterQual ExterCond Foundation
## 0.00 0.00 0.00 0.00 0.00
## HeatingQC CentralAir Electrical PavedDrive SaleCondition
## 0.00 0.00 0.00 0.00 0.00
Novamente recorremos ao caret para imputar essas categorias.
Para construir um modelo que imput a categoria vamos retirar todas as variáveis que possuem alguma poercentagem de valores nulos e deixar somente uma dela em cada modelo.
12.2 Criando os data frames
df.GarageType <- test.fac %>% select(-GarageFinish,-GarageCond,
-BsmtExposure,-BsmtQual,
-BsmtCond,-BsmtFinType1,
-MasVnrType,
-MSZoning,-Functional,
-Exterior1st,-Exterior2nd,
-KitchenQual)df.GarageFinish <- test.fac %>% select(-GarageType,-GarageCond,
-BsmtExposure,-BsmtQual,
-BsmtCond,-BsmtFinType1,
-MasVnrType,
-MSZoning,-Functional,
-Exterior1st,-Exterior2nd,
-KitchenQual)df.GarageCond <- test.fac %>% select(-GarageType,-GarageFinish,
-BsmtExposure,-BsmtQual,
-BsmtCond,-BsmtFinType1,
-MasVnrType,
-MSZoning,-Functional,
-Exterior1st,-Exterior2nd,
-KitchenQual)df.BsmtExposure <- test.fac %>% select(-GarageType,-GarageFinish,
-GarageCond,-BsmtQual,
-BsmtCond,-BsmtFinType1,
-MasVnrType,
-MSZoning,-Functional,
-Exterior1st,-Exterior2nd,
-KitchenQual)df.BsmtQual <- test.fac %>% select(-GarageType,-GarageFinish,
-GarageCond,-BsmtExposure,
-BsmtCond,-BsmtFinType1,
-MasVnrType,
-MSZoning,-Functional,
-Exterior1st,-Exterior2nd,
-KitchenQual)df.BsmtCond <- test.fac %>% select(-GarageType,-GarageFinish,
-GarageCond,-BsmtExposure,
-BsmtQual,-BsmtFinType1,
-MasVnrType,
-MSZoning,-Functional,
-Exterior1st,-Exterior2nd,
-KitchenQual)df.BsmtFinType1 <- test.fac %>% select(-GarageType,-GarageFinish,
-GarageCond,-BsmtExposure,
-BsmtQual,-BsmtCond,
-MasVnrType,
-MSZoning,-Functional,
-Exterior1st,-Exterior2nd,
-KitchenQual)df.MasVnrType <- test.fac %>% select(-GarageType,-GarageFinish,
-GarageCond,-BsmtExposure,
-BsmtQual,-BsmtCond,
-BsmtFinType1,
-MSZoning,-Functional,
-Exterior1st,-Exterior2nd,
-KitchenQual)df.MSZoning <- test.fac %>% select(-GarageType,-GarageFinish,
-GarageCond,-BsmtExposure,
-BsmtQual,-BsmtCond,
-BsmtFinType1,-MasVnrType,
-Functional,
-Exterior1st,-Exterior2nd,
-KitchenQual)df.Functional <- test.fac %>% select(-GarageType,-GarageFinish,
-GarageCond,-BsmtExposure,
-BsmtQual,-BsmtCond,
-BsmtFinType1,-MasVnrType,
-MSZoning,
-Exterior1st,-Exterior2nd,
-KitchenQual)df.Exterior1st <- test.fac %>% select(-GarageType,-GarageFinish,
-GarageCond,-BsmtExposure,
-BsmtQual,-BsmtCond,
-BsmtFinType1,-MasVnrType,
-MSZoning,
-Functional,-Exterior2nd,
-KitchenQual)df.Exterior2nd <- test.fac %>% select(-GarageType,-GarageFinish,
-GarageCond,-BsmtExposure,
-BsmtQual,-BsmtCond,
-BsmtFinType1,-MasVnrType,
-MSZoning,-Functional,
-Exterior1st,-KitchenQual)df.KitchenQual <- test.fac %>% select(-GarageType,-GarageFinish,
-GarageCond,-BsmtExposure,
-BsmtQual,-BsmtCond,
-BsmtFinType1,-MasVnrType,
-MSZoning,
-Functional,-Exterior1st,
-Exterior2nd)12.3 Usando o Random Forest
set.seed(12345)
fitControl <- trainControl(method="cv",
number=3,
savePredictions = 'final',
classProbs= F,
summaryFunction = multiClassSummary)set.seed(12345)
# Train the model using rf
rf.GarageType = train(GarageType ~ ., data= na.omit(df.GarageType),
tuneLength=5,trControl = fitControl,method='rf')
rf.GarageFinish = train(GarageFinish ~ ., data= na.omit(df.GarageFinish),
tuneLength=5,trControl = fitControl,method='rf')
rf.GarageCond = train(GarageCond ~ ., data= na.omit(df.GarageCond),
tuneLength=5,trControl = fitControl,method='rf')
rf.BsmtExposure = train(BsmtExposure ~ ., data= na.omit(df.BsmtExposure),
tuneLength=5,trControl = fitControl,method='rf')
rf.BsmtQual = train(BsmtQual ~ ., data= na.omit(df.BsmtQual),
tuneLength=5,trControl = fitControl,method='rf')
rf.BsmtCond = train(BsmtCond ~ ., data= na.omit(df.BsmtCond),
tuneLength=5,trControl = fitControl,method='rf')
rf.BsmtFinType1 = train(BsmtFinType1 ~ ., data= na.omit(df.BsmtFinType1),
tuneLength=5,trControl = fitControl,method='rf')
rf.MasVnrType = train(MasVnrType ~ ., data= na.omit(df.MasVnrType),
tuneLength=5,trControl = fitControl,method='rf')
rf.MSZoning = train(MSZoning ~ ., data= na.omit(df.MSZoning),
tuneLength=5,trControl = fitControl,method='rf')
rf.Functional = train(Functional ~ ., data= na.omit(df.Functional),
tuneLength=5,trControl = fitControl,method='rf')
rf.Exterior1st = train(Exterior1st ~ ., data= na.omit(df.Exterior1st),
tuneLength=5,trControl = fitControl,method='rf')
rf.Exterior2nd = train(Exterior2nd ~ ., data= na.omit(df.Exterior2nd),
tuneLength=5,trControl = fitControl,method='rf')
rf.KitchenQual = train(KitchenQual ~ ., data= na.omit(df.KitchenQual),
tuneLength=5,trControl = fitControl,method='rf')A função abaixo automatiza o processo de impute das categoricas nos valores nulos de cada variável.
f.pred <- function(fac,df.var,rf.model,var){
new.df <- df.var[is.na(df.var[,var]),!(names(df.var) %in% var)]
pred_rf <- predict(rf.model, newdata = new.df)
fac[is.na(fac[,var]),var] <- pred_rf
return(fac)
}13 Colocando tudo em listas
list.rfvars <- list(rf.GarageType,rf.GarageFinish,
rf.GarageCond,rf.BsmtExposure,
rf.BsmtQual,rf.BsmtCond,
rf.BsmtFinType1,rf.MasVnrType,
rf.MSZoning,rf.Functional,
rf.Exterior1st,rf.Exterior2nd,rf.KitchenQual)
list.dfvars <- list(df.GarageType,df.GarageFinish,
df.GarageCond,df.BsmtExposure,
df.BsmtQual,df.BsmtCond,
df.BsmtFinType1,df.MasVnrType,
df.MSZoning,df.Functional,
df.Exterior1st,df.Exterior2nd,df.KitchenQual)
vars <- c('GarageType','GarageFinish',
'GarageCond','BsmtExposure',
'BsmtQual','BsmtCond',
'BsmtFinType1','MasVnrType',
'MSZoning','Functional','Exterior1st',
'Exterior2nd','KitchenQual')Imputando os valores categóricos
for(j in 1:length(vars)) test.fac <- f.pred(test.fac,list.dfvars[[j]],list.rfvars[[j]],vars[j])14 Juntandos os data frames
Jutando os dados tipo inteiros e categóricos.
df.test <- bind_cols(test.int,test.fac)
df.testAgora com 1319 linhas nosso df.test encontra-se limpo e pronto para ser explorado.
15 Exportando os dados limpos
write.csv(df.train,'../outputs/df.train.csv')
write.csv(df.test,'../outputs/df.test.csv')